Introduction to Generalized Linear Models

ESA and Physalia Collaboration
4th March 2024

Dr Philip Leftwich

Welcome!

physalia logo

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in … .


Social media at .

UEA logo

Outline

  • Why be normal?

  • GLM with binary data

  • GLM with count data

What to expect during this workshop

These workshop will run for 2 hours each.

  • Combines slides, live coding examples, and exercises for you to participate in.

  • Ask questions in the chat throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Schitts Creek questions gif

Source: giphy.com

What you need for this workshop

  • You are comfortable with simple operations in R.

  • You know how to perform linear regression.

Introduce yourselves

https://blog.slido.com/virtual-icebreakers/

Workshop resources

Data

Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

Data

How to download:

covid <- readr::read_csv(
  "https://raw.githubusercontent.com/nrennie/f4sg-gams/main/data/covid.csv"
  )


  • See data/ folder on GitHub for pre-processing.

General linear models

Recapping general linear models

OLS

Limitations of general linear models

Q. What are the assumptions of a general linear model?

  • Assumes a linear relationship

  • Assumes normal distribution of the residuals

  • Assumes homogeneity of the residuals

Model assumptions

The equation of a linear model (lm) is given by:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where:

  • \(y_i\) is the predicted value of the response variable.

  • \(\beta_0\) is the intercept.

  • \(\beta_1\) is the slope coefficient, representing the change in the response variable for a one-unit change in the explanatory variable.

  • \(x_i\) is the value of the explanatory variable.

  • \(\epsilon_i\) represents the residuals of the model

The linear regression line is the most likely line given your data if we assume each data point comes from a hypothetical bell curve centered on the regression line

probability

Normal distribution

Normal distribution of the residuals

When using our residuals to calculate standard error, confidence intervals and statistical signifcance these are assumed to be drawn from a normal distribution with mean zero and a constant variance.

This implies that the residuals, or the distances between the observed values and the values predicted by the linear model, can be modeled by drawing random values from a normal distribution.

The linear model equation

Another way to write the lm equation is:

\[ y_i \sim N(\mu = \beta_0 + \beta_1 X_i, \sigma^2) \] :::{.fragment} Which literally means that \(y_i\) is drawn from a normal distribution with parameters \(\mu\) (which depends on \(x_i\)) and \(\sigma\)(which has the same value for all measures of \(Y\)). :::

Real data

How often do these assumptions hold true?

good model Count data Challenger data?

Assumption testing

Residual plots?

car::vif(mod)
lmtest::bptest(mod)

sresid <- residuals(janka_gamma, type = "pearson")
bptest(janka_gamma, ~ sresid)


shapiro.test(residuals(model))

What are Generalised linear models?

Theoretical foundations

Probability distributions and exponential families

A Linear Model Example

Janka timber hardness

This regression analysis uses data from the Australian forestry industry recording wood density and timber hardness from 36 different samples

  • Timber hardness is quantified on the “Janka” scale

  • The ‘amount of force required to embed a 0.444” steel ball into the wood to half of its diameter’.

Fitting the general linear model

janka_ls <- lm(hardness ~ dens, data = janka)

summary(janka_ls)
janka |> 
  ggplot(aes( x = dens, y = hardness))+
  geom_point()+
  geom_smooth(method = "lm")

Call:
lm(formula = hardness ~ dens, data = janka)

Residuals:
    Min      1Q  Median      3Q     Max 
-338.40  -96.98  -15.71   92.71  625.06 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183.1 on 34 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9478 
F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16

Challenge

Open exercises/ for prompts.

  • Fit the Linear Model lm(hardness ~ dens, data = janka) to the janka data.

  • Evaluate the model fit

10:00

See exercise_solutions for full solution.

Model diagnostics

plot(janka_ls, which=2)
plot(janka_ls, which=3)

Formal tests

Violations of assumptions

library(lmtest)

# Breusch-Pagan Test of Homoscedasticity
lmtest::bptest(janka_ls)

    studentized Breusch-Pagan test

data:  janka_ls
BP = 4.4668, df = 1, p-value = 0.03456
# Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(janka_ls))

    Shapiro-Wilk normality test

data:  residuals(janka_ls)
W = 0.93641, p-value = 0.03933

Challenge

Open exercises/ for prompts.

  • What can we try to improve the fit of our lm(hardness ~ dens, data = janka)

  • Let’s make suggestions then try them out

15:00

See exercise_solutions for full solution.

BoxCox

Transformations

janka_sqrt <- lm(sqrt(hardness) ~ dens, data = janka)

plot(janka_sqrt, which=2)
plot(janka_sqrt, which=3)

Model fit

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = y ~ sqrt(x)) +  # regression line
  labs(title = "Sqrt Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Residual plots

janka_log <- lm(log(hardness) ~ dens, data = janka)

plot(janka_log, which=2)
plot(janka_log, which=3)

Polynomials

janka_poly <- lm(log(hardness) ~ poly(dens, 2), data = janka)

plot(janka_poly, which=2)
plot(janka_poly, which=3)

Summary polynomial

summary(janka_poly)

Call:
lm(formula = log(hardness) ~ poly(dens, 2), data = janka)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22331 -0.05708 -0.01104  0.07500  0.18871 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.1362     0.0168 424.896  < 2e-16 ***
poly(dens, 2)1   3.3862     0.1008  33.603  < 2e-16 ***
poly(dens, 2)2  -0.5395     0.1008  -5.354 6.49e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1008 on 33 degrees of freedom
Multiple R-squared:  0.9723,    Adjusted R-squared:  0.9706 
F-statistic: 578.9 on 2 and 33 DF,  p-value: < 2.2e-16

Troublesome transformations

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ log(x))) +  # regression line
  labs(title = "Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title
ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ poly(log(x), 2))) +  # regression line
  labs(title = "Quadratic Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title

Weighted least squares

weights = 1/sqrt(hardness): This argument specifies the weights to be used in the fitting process. In this case, the weights are set to be inversely proportional to the square root of hardness.

1/sqrt(hardness): This means that observations with higher values of hardness will have lower weights, and observations with lower values of hardness will have higher weights. This could be used to give more importance to certain observations in the model fitting process. data = janka: This specifies the data frame (janka) from which the variables (sqrt(hardness) and dens) are to be taken.

Putting it all together, the code janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka) fits a linear model where the response variable is the square root of hardness, the predictor variable is dens, and weights are inversely proportional to the square root of hardness. This is often referred to as a weighted least squares (WLS) regression model, where observations are weighted differently based on certain criteria (in this case, the square root of hardness).

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)

plot(janka_wls, which=2)
plot(janka_wls, which=3)

prediction_data <- data.frame(dens = sort(unique(janka$dens)))
predictions <- predict(janka_wls, newdata = prediction_data, interval = "confidence", level = 0.95)

# Adding predictions and confidence intervals to the dataframe
prediction_data$wls_pred = predictions[, "fit"]
prediction_data$wls_pred.lwr = predictions[, "lwr"]
prediction_data$wls_pred.upr = predictions[, "upr"]

ggplot(janka) +
     geom_ribbon(data = prediction_data, aes(x = dens, ymin = wls_pred.lwr, ymax = wls_pred.upr), alpha = 0.8, fill = "lightgray")+
    geom_line(data = prediction_data, aes(x = dens, y = wls_pred), color = "blue")+
  geom_point(aes(x = dens, y = sqrt(hardness)))

Troublesome transformations

See screenshots

A Generalized linear model approach

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

summary(janka_glm)

Call:
glm(formula = hardness ~ dens, family = gaussian(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.60869    1.47660   1.767   0.0863 .  
dens         0.75194    0.02722  27.622   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25498.81)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:   866960  on 34  degrees of freedom
AIC: 471.38

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(gaussian(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Residuals

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

plot(janka_glm, which=2)
plot(janka_glm, which=3)

Gamma distribution

Univariate analysis

summary statistics
------
min:  413   max:  3260 
median:  1195 
mean:  1469.472 
estimated sd:  801.5172 
estimated skewness:  0.6579162 
estimated kurtosis:  2.571426 

Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters : 
        estimate Std. Error
shape 12.4371477  2.8925891
rate   0.3368412  0.0799412
Loglikelihood:  -134.6402   AIC:  273.2805   BIC:  276.4475 
Correlation matrix:
          shape      rate
shape 1.0000000 0.9799729
rate  0.9799729 1.0000000
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters : 
     estimate Std. Error
mean 36.92229    1.71769
sd   10.30614    1.21459
Loglikelihood:  -135.0604   AIC:  274.1208   BIC:  277.2879 
Correlation matrix:
     mean sd
mean    1  0
sd      0  1

Gamma GLM

janka_gamma <- glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt"))

summary(janka_gamma)

Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(Gamma(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Exercise

Challenge

Compare

library(MuMIn)

#r.squaredLR()

# r squared

Likelihood

Log-Likelihood

Maximum Likelihood

Deviance

Pseudo R-squared

AIC

Likelihood ratio Test

Challenge?

GLMs for binary data

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

  • Presence/Absence of a species

  • Presence/Absence of a disease

  • Success/Failure to observe behaviour

  • Survival/Death of organisms

Wish to determine if \(P/A∼ Variable\)

Called a logistic regression or logit model

Binary variables

  • Investigating mayfly abundance in relation to pollution gradient measured in Cumulative Criterion Units (CCU)

  • Mayflies serve as indicators of stream health due to sensitivity to metal pollution

  • Higher CCU values indicate increased metal pollution, potentially impacting mayfly presence or absence in the stream

Example

If I just fit a regular linear model to this data:

ggplot(Mayflies, aes(x=CCU, y=Occupancy)) + geom_point()+
  geom_smooth(method = "lm")
ggplot(Mayflies, aes(x=CCU, y=Occupancy)) + geom_point()+
  geom_smooth(method = "glm", method.args = list(binomial(link = "logit"))) +  # regression line
  labs(title = "Logit-link Binomial Regression",
       x = "X", y = "Y")  # axis labels and title

Bernoulli Distribution

The Bernoulli distribution models the probability of success or failure in a single trial of a binary experiment, where success occurs with probability \(p\) and failure with probability ${1-p}$1

Probability distribution

\(logit(p) = \log \frac{p}{1 - p}\)

\(y_i = Bernoulli(p) = \frac{p}{1 - p}\)

Mean of distribution Probability (p) of observing an outcome

Variance of observed responses As observed responses approach 0 or 1, the variance of the distribution decreases

Probability, odds, logit-odds

Example

mayfly_glm <- glm(Occupancy ~ CCU, family = binomial(link = "logit"), data = Mayflies)

summary(mayfly_glm)

Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

Each coefficient corresponds to a predictor variable, indicating the change in the log odds of the response variable associated with a one-unit change in that predictor, holding other predictors constant.

Interpret the model

  • Interpretation of coefficients involves exponentiating them to obtain odds ratios.

  • An odds ratio greater than 1 implies higher odds of the event occurring with an increase in the predictor.

  • An odds ratio less than 1 implies lower odds of the event occurring with an increase in the predictor.

Intepret the model

Probability Odds Log Odds Verbiage
\(p=.95\) \(\frac{95}{5} = \frac{19}{1} = 19\) 2.94 19 to 1 odds for
————- ———————————– —————– ————————
\(p=.75\) \(\frac{75}{25} = \frac{3}{1} = 3\) 1.09 3 to 1 odds for
————- ———————————– —————– ————————
\(p=.50\) \(\frac{50}{50} = \frac{1}{1} = 1\) 0 1 to 1 odds
————- ———————————– —————– ————————
\(p=.25\) \(\frac{25}{75} = \frac{1}{3} = 0.33\) -1.11 3 to 1 odds against
————- ———————————– —————– ————————
\(p=.05\) \(\frac{95}{5} = \frac{1}{19} = 0.0526\) -2.94 19 to 1 odds against

Logit odds

When we use a binomial model, we produce the ‘log-odds’, this produces a fully unconstrained linear regression as anything less than 1, can now occupy an infinite negative value -∞ to ∞.

\[\log\left(\frac{p}{1-p}\right) = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\]

\[\frac{p}{1-p} = e^{\beta_{0}}e^{\beta_{1}x_{1}}e^{\beta_{2}x_{2}}\]

  • We can interpret \(\beta_{1}\) and \(\beta_{2}\) as the increase in the log odds for every unit increase in \(x_{1}\) and \(x_{2}\).

  • We could alternatively interpret \(\beta_{1}\) and \(\beta_{2}\) using the notion that a one unit change in \(x_{1}\) as a percent change of \(e^{\beta_{1}}\) in the odds.

For the intercept: - The estimate as log odds is 5.102. - Therefore the odds are 164 - Over 99% probability of observing mayfly when CCU = 0

For the predictor variable CCU: - The estimate is -3.051. - This means that for every one unit increase in CCU, the log odds of the response variable decreases by 3.051, holding all other predictors constant.

Now, let’s interpret the coefficient for CCU using the odds ratio interpretation: - The odds ratio associated with CCU is calculated as \(e^{\beta_{\text{CCU}}} = e^{-3.051}\). - Therefore, \(e^{\beta_{\text{CCU}}} \approx 0.048\). - This means that for every one unit increase in CCU, the odds of the response variable decrease by a multiple of 0.048

Challenge

Open exercises/ for prompts.

  • Work with the odds and probability calculators

  • Check how comfortable you are with additive log-odds, multiplicative odds and changing probability

10:00

See exercise_solutions for full solution.

Model fit

For a simple Bernoulli/Binary GLM there are only a few checks that apply:

Model fit

Model fit

DHARMa fit

Prediction

           1            2            3            4            5            6 
9.939504e-01 8.860420e-01 2.689747e-01 1.711402e-02 8.233054e-04 3.899164e-05 

Emmeans

 CCU     prob       SE  df asymp.LCL asymp.UCL
   0 0.993950 0.014243 Inf  6.13e-01    0.9999
   1 0.886042 0.128537 Inf  3.91e-01    0.9895
   2 0.268975 0.147493 Inf  7.80e-02    0.6155
   3 0.017114 0.026259 Inf  8.16e-04    0.2707
   4 0.000823 0.002214 Inf  4.20e-06    0.1387
   5 0.000039 0.000151 Inf  0.00e+00    0.0714

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

Prediction plots with emmeans

Exercise

Challenge

Open exercises/ for prompts.

  • Using the malaria dataset fit a binomial glm to look at the predictors of malaria in the Seychelles warbler

  • Check the model fit

  • Make predictions

  • Produce a suitable figure

30:00

See exercise_solutions for full solution.

GLMs for Binomial data

Proportion data and GLM

Sometimes, count data aligns more closely with logistic regression methodologies than it initially appears.

We’re not looking at typical count data when measuring the number of occurrences with a known total sample size.

Imagine, for example, we’re studying the prevalence of a native underbrush species across various forest plots. We assess 15 different plots, each meticulously surveyed for the presence of this underbrush, counting the number of square meters where the underbrush is present versus absent within a standard area:

\[\frac{M^2~\text{with native underbrush (successes)}}{\text{Total surveyed area in}~M^2~(\text{trials})}\]

Bound between zero and one

Binomial distribution

\(logit(p) = \log \frac{p}{1 - p}\)

\(y_i = Binomial(n,p)\)

  • It is the collection of Bernoulli trials for the same event

  • It is represented by the number of Bernoulli trials \(n\)

  • It is also the probability of an event in each trial \(p\)

\(P(X = s) = C(n, s) \cdot p^s \cdot (1 - p)^{n - s}\)

Where:

  • \(n\) is the total number of trials of an event.
  • \(s\) corresponds to the number of times an event should occur.
  • \(p\) is the probability that the event will occur.
  • \((1 - p)\) is the probability that the event will not occur.
  • \(C\) term represents combinations, calculated as \(C(n, s) = \frac{n!}{s!(n - s)!}\), representing the number of ways to choose \(s\) successes out of \(n\) trials.

Example

TEXT

:::: columns

# According to the situation:
# n = 5,
# s = 5,
# p = 0.95,
# (1 — p) = 0.05

p = 1 * (0.95)^(5) * (0.05)^(0)

p
[1] 0.7737809
[1] 0.7737809

Case study

Binomial VS Bernoulli Keypoints!

Bernoulli deals with the outcome of the single trial of the event, whereas Binomial deals with the outcome of the multiple trials of the single event.

Bernoulli is used when the outcome of an event is required for only one time, whereas the Binomial is used when the outcome of an event is required multiple times.

Exercise

Weights

Predictions/write-up

Overdispersion

Binomial \(P(X = s) = C(n, s) \cdot p^s \cdot (1 - p)^{n - s}\)

Quasibinomial \(P(X = s) = C(n, s) \cdot p(p+k\theta)^{s-1} \cdot (1 - pk\theta)^{n - s}\)

Where:

  • \(n\) is the total number of trials of an event.

  • \(s\) corresponds to the number of times an event should occur.

  • \(p\) is the probability that the event will occur.

  • \((1 - p)\) is the probability that the event will not occur.

  • \(C\) term represents combinations, calculated as \(C(n, s) = \frac{n!}{s!(n - s)!}\), representing the number of ways to choose \(s\) successes out of \(n\) trials.

  • \(\theta\) term describes additional variance outside of the Binomial distribution

Challenge

GLMs for count data

Poisson

Count or rate data are ubiquitous in the life sciences (e.g number of parasites per microlitre of blood, number of species counted in a particular area). These type of data are discrete and non-negative.

A useful distribution to model abundance data is the Poisson distribution:

a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution.

GLM

Recall the simple linear regression case (i.e a GLM with a Gaussian error structure and identity link). For the sake of clarity let’s consider a single explanatory variable where \(\mu\) is the mean for Y:

\(\mu_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn\)

\(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\)

  • The mean function is unconstrained, i.e the value of \(\beta_0 + \beta_1X\) can range from \(-\infty\) to \(+\infty\).

  • If we want to model count data we therefore want to constrain this mean to be positive only.

  • Mathematically we can do this by taking the logarithm of the mean (the log is the default link for the Poisson distribution).

  • We then assume our count data variance to be Poisson distributed (a discrete, non-negative distribution), to obtain our Poisson regression model (to be consistent with the statistics literature we will rename \(\mu\) to \(\lambda\)):

GLM Poisson

Poisson equation

Poisson limitations

The Poisson distribution has the following characteristics:

  • Discrete variable (integer), defined on the range \(0, 1, \dots, \infty\).

  • A single rate parameter \(\lambda\), where \(\lambda > 0\).

  • Mean = \(\lambda\)

  • Variance = \(\lambda\)

Poisson: Case study

Cuckoo

In a study by Kilner et al. (1999), the authors studied the begging rate of nestlings in relation to total mass of the brood of reed warbler chicks and cuckoo chicks. The question of interest is:

How does nestling mass affect begging rates between the different species?

This model is inadequate. It is predicting negative begging calls within the range of the observed data, which clearly does not make any sense.

cuckoo_lm <- lm(Beg ~ Mass + Species + Mass:Species, data = cuckoo)

Diagnostics

Let us display the model diagnostics plots for this linear model.

  • Curvature

  • Funnelling effect

Biological data and distributions

fitdistrplus?

Poisson model

We should therefore try a different model structure.

The response variable in this case is a classic count data: discrete and bounded below by zero (i.e we cannot have negative counts). We will therefore try a Poisson model using the canonical log link function for the mean:

  • λ varies with x (mass) which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

  • Predicted values will now be integers instead of fractions

  • The model will never predict negative values (Poisson is strictly positive)

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

where \(M_i\) is nestling mass and \(S_i\) a dummy variable

\(S_i = \left\{\begin{array}{ll}1 & \text{if } i \text{ is warbler},\\0 & \text{otherwise}.\end{array}\right.\)

The term \(M_iS_i\) is an interaction term. Think of this as an additional explanatory variable in our model. Effectively it lets us have different slopes for different species (without an interaction term we assume that both species have the same slope for the relationship between begging rate and mass, and only the intercept differ).

Regression lines

The mean regression lines for the two species look like this:

  • Cuckoo (\(S_i=0\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 0) + (\beta_3 \times M_i \times 0)\\\log{\lambda} & = \beta_0 + \beta_1 M_i\end{aligned}\)

  • Warbler (\(S_i=1\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 1) + (\beta_3 \times M_i \times 1)\\\log{\lambda} & = \beta_0 + \beta_1 M_i + \beta_2 + \beta_3M_i\\\end{aligned}\)

Fit the Poisson model

Fit the model with the interaction term in R:

cuckoo_glm1 <- glm(Beg ~ Mass + Species + Mass:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm1)

Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Note there appears to be a negative interaction effect for Species:Mass, indicating that Begging calls do not increase with mass as much as you would expect for Warblers as compared to Cuckoos.

Exercise

Open exercises/exercise_04.R for prompts.

  • Fit a Poisson model to the cuckoo data

  • Look at the residual plots - how have they changed?

10:00

Model summary

Parameter estimates

Deviance

Poisson model check

Mean-centering

Model selection

Likelihood ratio tests, AIC - exercise/reminder?

Exercise

Look at the effect of mean-centering on estimates

Offset - for rates


Call:
glm(formula = Call_Total ~ mass.c + Species + mass.c:Species, 
    family = poisson(link = "log"), data = cuckoo, offset = log(Mins))

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.249441   0.030831  72.961   <2e-16 ***
mass.c                 0.129261   0.002502  51.671   <2e-16 ***
SpeciesWarbler         0.300202   0.035330   8.497   <2e-16 ***
mass.c:SpeciesWarbler -0.057893   0.002829 -20.462   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9597.9  on 50  degrees of freedom
Residual deviance: 3071.8  on 47  degrees of freedom
AIC: 3334.6

Number of Fisher Scoring iterations: 6

Poisson limitations

So for the Poisson regression case we assume that the mean and variance are the same. Hence, as the mean increases, the variance increases also (heteroscedascity). This may or may not be a sensible assumption so watch out! Just because a Poisson distribution usually fits well for count data, doesn’t mean that a Gaussian distribution can’t always work.

Recall the link function between the predictors and the mean and the rules of logarithms (if \(\log{\lambda} = k\)(value of predictors), then \(\lambda = e^k\)):

Overdispersion

When the residual deviance is higher than the residual degrees of freedom, we say that the model is overdispersed. This situation is mathematically represented as:

\[ \phi = \frac{\text{Residual deviance}}{\text{Residual degrees of freedom}} \]

Overdispersion occurs when the variance in the data is even higher than the mean, indicating that the Poisson distribution might not be the best choice. This can be due to many reasons, such as the presence of many zeros, many very high values, or missing covariates in the model.

Solutions

Overdispersion statistic values > 1
Causes of over-dispersion What to do about it
Model mis-specification (missing covariates or interactions) Add more covariates or interaction terms
Too many zeros (“zero inflation”) Use a zero-inflated model
Non-independence of observations Use a generalised mixed model with random effects to take non-independence into account
Variance is larger than the mean Use a quasipoisson GLM if overdispersion = 2-15. Use a negative binomial GLM if > 15-20

Quasi-Poisson GLM

The systematic part and the link function remain the same

\[ \log{\lambda} = \beta_0 + \beta_1 x_i \]

phi\(\phi\) is a dispersion parameter, it will not affect estimates but will affect significance. Standard errors are multiplied by \(\sqrt\phi\)

\[ Y_i = Poisson(\phi \lambda_i) \]

Where:

  • \(Yi\) is the response variable.
  • \(\phi\) is the dispersion parameter, which adjusts the variance of the distribution

Confidence intervals and p-values WILL change.

Exercise

Open exercises/exercise_04.R for prompts.

  • Fit a quasipoisson model to the cuckoo data or update the previous one
  • Look at the residual plots - how have they changed?

  • Look at the SE and p-values - how have they changed

  • Calculate new 95% confidence intervals

Summary

Deviance analysis

F distributions

The presence of overdispersion suggested the use of the F-test for nested models. We will test if the squared term can be dropped from the model.

https://towardsdatascience.com/the-most-important-statistical-test-dee01f4d50cf

Zero-inflation

“Zero-inflation” refers to the problem of “too many zeros”.

The dependent variable contains more zeros than would be expected under the standard statistical distribution for zero-bounded count data (Poisson or negative binomial).

Zero-inflation is a common feature of count datasets. Large numbers of zeros can arise because

  1. There was nothing there to count (true zeros)

  2. Things were there but missed (false zeros)

Zero-inflation

Model Acronym/Abbr Source of Overdispersion
Zero-inflated Poisson ZIP Zeros
Negative Binomial NegBin Positive integers
Zero-inflated Negative Binomial ZINB Zeros + Positive integers

Zip

ZIP in R

In a ZIP model we define two halves of the model; the first half models the count process (i.e. a Poisson GLM explaining non-zero integers) and the second half models the binomial (i.e. explains the zeros).

install.packages("pscl")
zip1 <- zeroinf(y ~ x1 + x2 | x1 + x2,
                dist = "poisson",
                link = "logit",
                data = dataframe)

Alternative model design

You can change the zero-inflation model’s link function and use AIC to decide on the best model

# zeros modelled with a constant
y ~ x1 + x2

# zeros modelled with the same variables
y ~ x1 + x2 | x1 + x2

# zeros modelled with different variables
y ~ x1 + x2 | z1 +z2

Checking for zero-inflation

RETURN TO CUCKOO DATA!

Fit the model

AIC

Diagnostics

qqnorm(sresid) qqline(sresid, col = “red”)

Negative Binomial

Negative Binomial

Negative binomial GLMs are favored when overdispersion is high.

It has two parameters, \(μ\) and \(k\). \(k\) controls for the dispersion parameter (smaller \(k\) indicates higher dispersion). It corresponds to a combination of two distributions (Poisson and gamma).

It assumes that the \(Y_i\) are Poisson distributed with the mean \(μ_i\) assumed to follow a gamma distribution:

\[ E(Y_i) = μ_i \\ \text{Var}(Y_i) = μ_i + μ_i^2/k \]

Fitting a negative binomial in R

Negative binomial is not in the glm() function

Exercise

Challenge

Other distributions

Gamma

Second column

Simulate

Survival

Wrap-up

Where next?



  • (Generalised) Linear Mixed Models
  • Generalised Additive Models

Live demo!



See examples/example_04.R for full code.

Exercise 4

Open exercises/exercise_04.R for prompts.

  • Look at the ACF and PACF plots of the residuals from your previous GAM.

  • Try fitting a GAMM model instead.

10:00

See exercise_solutions/exercise_solutions_04.R for full solution.

Additional resources

  • GLMs resource list:

  • Discovering Statistics - Andy Field

  • An Introduction to Generalized Linear Models - Dobson & Barnett

  • An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani

  • Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.